Search CORE

659 research outputs found

SOAP: Efficient Feature Selection of Numeric Attributes

Author: C. A. R. Hoare
G. Pagallo
H. Almuallim
J. Quinlan
R. Kohavi
R. Setiono
Publication venue
Publication date: 01/01/2002
Field of study

The attribute selection techniques for supervised learning, used in the preprocessing phase to emphasize the most relevant attributes, allow making models of classification simpler and easy to understand. Depending on the method to apply: starting point, search organization, evaluation strategy, and the stopping criterion, there is an added cost to the classification algorithm that we are going to use, that normally will be compensated, in greater or smaller extent, by the attribute reduction in the classification model. The algorithm (SOAP: Selection of Attributes by Projection) has some interesting characteristics: lower computational cost (O(mn log n) m attributes and n examples in the data set) with respect to other typical algorithms due to the absence of distance and statistical calculations; with no need for transformation. The performance of SOAP is analysed in two ways: percentage of reduction and classification. SOAP has been compared to CFS [6] and ReliefF [11]. The results are generated by C4.5 and 1NN before and after the application of the algorithms

CiteSeerX

Crossref

idUS. Depósito de Investigación Universidad de Sevilla

A comparison of model validation techniques for audio-visual speech recognition

Author: E Kocaguneli
G Bradski
H Li
K Chauhan
MZ Ibrahim
P Kakumanu
R Kohavi
Publication venue
Publication date: 01/01/2017
Field of study

This paper implements and compares the performance of a number of techniques proposed for improving the accuracy of Automatic Speech Recognition (ASR) systems. As ASR that uses only speech can be contaminated by environmental noise, in some applications it may improve performance to employ Audio-Visual Speech Recognition (AVSR), in which recognition uses both audio information and mouth movements obtained from a video recording of the speaker’s face region. In this paper, model validation techniques, namely the holdout method, leave-one-out cross validation and bootstrap validation, are implemented to validate the performance of an AVSR system as well as to provide a comparison of the performance of the validation techniques themselves. A new speech data corpus is used, namely the Loughborough University Audio-Visual (LUNA-V) dataset that contains 10 speakers with five sets of samples uttered by each speaker. The database is divided into training and testing sets and processed in manners suitable for the validation techniques under investigation. The performance is evaluated using a range of different signal-to-noise ratio values using a variety of noise types obtained from the NOISEX-92 dataset

Loughborough University Institutional Repository

Crossref

Optimization experiments in the continuous space: The limited growth optimistic optimization algorithm

Author: A Savitzky
DI Mattos
G Schermann
G Tamburrelli
GL Urban
L Bottou
N Juristo
R Kohavi
RS Sutton
S Bubeck
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Online controlled experiments are extensively used by web-facing companies to validate and optimize their systems, providing a competitive advantage in their business. As the number of experiments scale, companies aim to invest their experimentation resources in larger feature changes and leave the automated techniques to optimize smaller features. Optimization experiments in the continuous space are encompassed in the many-armed bandits class of problems. Although previous research provides algorithms for solving this class of problems, these algorithms were not implemented in real-world online experimentation problems and do not consider the application constraints, such as time to compute a solution, selection of a best arm and the estimation of the mean-reward function. This work discusses the online experiments in context of the many-armed bandits class of problems and provides three main contributions: (1) an algorithm modification to include online experiments constraints, (2) implementation of this algorithm in an industrial setting in collaboration with Sony Mobile, and (3) statistical evidence that supports the modification of the algorithm for online experiments scenarios. These contributions support the relevance of the LG-HOO algorithm in the context of optimization experiments and show how the algorithm can be used to support continuous optimization of online systems in stochastic scenarios

Crossref

Chalmers Research

Different Approaches to Community Evolution Prediction in Blogosphere

Author: Davis D.
Gliwa B.
John G. H.
Kohavi R.
Quinlan R.
Richter Y.
Wasserman S.
Zheleva E.
Zygmunt A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/06/2013
Field of study

Predicting the future direction of community evolution is a problem with high theoretical and practical significance. It allows to determine which characteristics describing communities have importance from the point of view of their future behaviour. Knowledge about the probable future career of the community aids in the decision concerning investing in contact with members of a given community and carrying out actions to achieve a key position in it. It also allows to determine effective ways of forming opinions or to protect group participants against such activities. In the paper, a new approach to group identification and prediction of future events is presented together with the comparison to existing method. Performed experiments prove a high quality of prediction results. Comparison to previous studies shows that using many measures to describe the group profile, and in consequence as a classifier input, can improve predictions.Comment: SNAA2013 at ASONAM2013 IEEE Computer Societ

arXiv.org e-Print Archive

Crossref

Shaping electron wave functions in a carbon nanotube with a parallel magnetic field

Author: Andrew J. O. Whitehouse
Bhargava N.
Conti-Ramsden G.
Cruz J. A.
David A. Copland
Efron B.
Fenson L.
Frankenburg W. K.
Hall M. A.
Hu X.
James G. Scott
Katie L. McMahon
Kohavi R.
Kotthoff L.
Lebarton E. S.
Liu S.
Martyn Symons
Quinlan J. R.
Rebecca Armstrong
Semel W.
Squires J.
Straker L.
Wendy L. Arnott
Publication venue
Publication date: 01/01/2018
Field of study

A magnetic field, through its vector potential, usually causes measurable changes in the electron wave function only in the direction transverse to the field. Here we demonstrate experimentally and theoretically that in carbon nanotube quantum dots, combining cylindrical topology and bipartite hexagonal lattice, a magnetic field along the nanotube axis impacts also the longitudinal profile of the electronic states. With the high (up to 17T) magnetic fields in our experiment the wave functions can be tuned all the way from "half-wave resonator" shape, with nodes at both ends, to "quarter-wave resonator" shape, with an antinode at one end. This in turn causes a distinct dependence of the conductance on the magnetic field. Our results demonstrate a new strategy for the control of wave functions using magnetic fields in quantum systems with nontrivial lattice and topology.Comment: 5 figure

arXiv.org e-Print Archive

University of Regensburg Publication Server

Crossref

Queensland University of Technology ePrints Archive

University of Queensland eSpace

FigShare

Insights into the feature selection problem using local optima networks

Author: A Ben-David
B Xue
CM Reidys
D Aha
E Amaldi
E Hancer
G Chandrashekar
G Ochoa
G Ochoa
KM Malan
M Hall
R Kohavi
S Vérel
TM Cover
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

The binary feature selection problem is investigated in this paper. Feature selection fitness landscape analysis is done, which allows for a better understanding of the behaviour of feature selection algorithms. Local optima networks are employed as a tool to visualise and characterise the fitness landscapes of the feature selection problem in the context of classification. An analysis of the fitness landscape global structure is provided, based on seven real-world datasets with up to 17 features. Formation of neutral global optima plateaus are shown to indicate the existence of irrelevant features in the datasets. Removal of irrelevant features resulted in a reduction of neutrality and the ratio of local optima to the size of the search space, resulting in improved performance of genetic algorithm search in finding the global optimum

Crossref

Stirling Online Research Repository (RIOXX)

Stirling Online Research Repository

Assisted Diagnosis of Parkinsonism Based on the Striatal Morphology

Author: Diego Castillo-Barnes
Fermín Segovia
Francisco J. Martínez-Murcia
Greenberg D.
Hochberg Y.
Javier Ramírez
Juan M. Górriz
Kohavi R.
Mammone N.
Sáez G.
Theodoridis S.
Towey D. J.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/2019
Field of study

Parkinsonism is a clinical syndrome characterized by the progressive loss of striatal dopamine. Its diagnosis is usually corroborated by neuroimaging data such as DaTSCAN neuroimages that allow visualizing the possible dopamine deficiency. During the last decade, a number of computer systems have been proposed to automatically analyze DaTSCAN neuroimages, eliminating the subjectivity inherent to the visual examination of the data. In this work, we propose a computer system based on machine learning to separate Parkinsonian patients and control subjects using the size and shape of the striatal region, modeled from DaTSCAN data. First, an algorithm based on adaptative thresholding is used to parcel the striatum. This region is then divided into two according to the brain hemisphere division and characterized with 152 measures, extracted from the volume and its three possible 2-dimensional projections. Afterwards, the Bhattacharyya distance is used to discard the least discriminative measures and, finally, the neuroimage category is estimated by means of a Support Vector Machine classifier. This method was evaluated using a dataset with 189 DaTSCAN neuroimages, obtaining an accuracy rate over 94%. This rate outperforms those obtained by previous approaches that use the intensity of each striatal voxel as a feature.This work was supported by the MINECO/ FEDER under the TEC2015-64718-R project, the Ministry of Economy, Innovation, Science and Employment of the Junta de Andaluc´ıa under the P11-TIC-7103 Excellence Project and the Vicerectorate of Research and Knowledge Transfer of the University of Granada

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Granada

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Author: A Ivshina
Anne-Claire Haury
C Ambroise
C Fan
C Lai
C Sotiriou
C Sotiriou
F Reyal
G Abraham
H Zou
I Guyon
I Guyon
J Bi
J Mairal
J Wang
Jean-Philippe Vert
JPA Ioannidis
L Ein-Dor
L Ein-Dor
M Dai
Muy-Teck Teh
N Meinshausen
P Wirapati
Pierre Gestraud
R Kohavi
R Shen
R Simon
R Tibshirani
RA Irizarry
S Michiels
T Abeel
T Barrett
T Iwamoto
W Shi
Y Benjamini
Y Pawitan
Y Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/06/2011
Field of study

Motivation: Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. Methods: We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. Results: We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Simple filter methods generally outperform more complex embedded or wrapper methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results. Availability: Code and data are publicly available at http://cbio.ensmp.fr/~ahaury/

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

HAL Descartes

HAL-MINES ParisTech

The identification of informative genes from multiple datasets with increasing complexity

Author: AH Fielding
Allan Tucker
BC Haynes
C Zhang
D Grossman
D Heckerman
D Madigan
DM Chickering
DR Rhodes
E Segal
G Schwarz
H Ma
J Bockhorst
J Pearl
J Su
JB Tobler
JM Peña
KK Tomczak
KP Murphy
M Miron
M Stone
N Friedman
N Friedman
N Friedman
Peter AC 't Hoen
R Jelier
R Kohavi
R Mac Nally
RA Irizarry
S Iezzi
S Yahya Anvar
SS Shen-Orr
TI Lee
TVan den Bulcke
W Lam
WL Buntine
X Xu
Y Cao
Y Lai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Background In microarray data analysis, factors such as data quality, biological variation, and the increasingly multi-layered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes. Results In this paper, we identify the most appropriate model complexity using cross-validation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesis-related genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes. Conclusions We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Leiden University Scholary Publications

Brunel University Research Archive

Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data

Author: Axel Benner
C Chang
D Jones
DB Allison
E Dimitriadou
F Markowetz
G Fung
Grischa Toedt
H Froehlich
H Zou
HH Zhang
I Guyon
I Guyon
I Inza
J Fan
J Quackenbush
JC Hsu
JD Hoheisel
JD Storey
L Wang
L Wang
LJ van't Veer
M Greiner
M Johannes
MJ van de Vijver
N Becker
Natalia Becker
Peter Lichter
PS Bradley
Q Liu
R Kohavi
R Kohavi
R Tibshirani
T Hastie
V Vapnik
W Gu
X Li
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net. We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone. Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution. Results Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (<it>L</it>1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error. Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations. Conclusions The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters. The penalized SVM classification algorithms as well as fixed grid and interval search for finding appropriate tuning parameters were implemented in our freely available R package 'penalizedSVM'. We conclude that the Elastic SCAD SVM is a flexible and robust tool for classification and feature selection tasks for high-dimensional data such as microarray data sets.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central